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Abstract. We suggest a variation of the Hellerstein—Koutsoupias—Papadimitriou indexability model 
for datasets equipped with a similarity measure, with the aim of better understanding the structure 
of indexing schemes for similarity-based search and the geometry of similarity workloads. This in 
particular provides a unified approach to a great variety of schemes used to index into metric spaces 
and facilitates their transfer to more general similarity measures such as quasi-metrics. We discuss 
links between performance of indexing schemes and high-dimensional geometry. The concepts and 
results are illustrated on a very large concrete dataset of peptide fragments equipped with a biologi¬ 
cally significant similarity measure. 


Keywords: Similarity workload, metrics, quasi-metrics, indexing schemes, the curse of dimen¬ 
sionality 


1. Introduction 

Indexing into very large datasets with the aim of fast similarity search still remains a challenging and of¬ 
ten elusive problem of data engineering. The main motivation for the present work comes from sequence- 
based biology, where high-speed access methods for biological sequence databases will be vital both for 
developing large-scale datamining projects H and for testing the nascent mathematical conceptual mod¬ 
els Q. 

What is needed, is a fully developed mathematical paradigm of indexability for similarity search that 
would incorporate the existing structures of database theory and possess a predictive power. While the 
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fundamental building blocks - similarity measures, data distributions, hierarchical tree index structures, 
and so forth - are in plain view, the only way they can be assembled together is by examining concrete 
datasets of importance and taking one step at a time. Theoretical developments and massive amounts of 
computational work must proceed in parallel; generally, we share the philosophy espoused in II3- 
The master concept was introduced in the paper El (cf. also El): a workload, W, is a triple 
consisting of a search domain f), a dataset X, and a set of queries, Q. An indexing scheme according 
to El is just a collection of blocks covering X. While this concept is fully adequate for many aspects 
of theory, we believe that analysis of indexing schemes for similarity search, with its strong geometric 
flavour, requires a more structured approach, and so we put forward a concept of an indexing scheme 
as a system of blocks equipped with a tree-like search structure and decision functions at each step. We 
also suggest the notion of a reduction of one workload to another, allowing one to create new access 
methods from the existing ones. One example is the new concept of a quasi-metric tree, proposed here. 
We discuss how geometry of high dimensions (asymptotic geometric analysis) may offer a constructive 
insight into the nature of the curse of dimensionality. 

Our concepts and results are illustrated throughout on a concrete dataset of short peptide fragments, 
containing nearly 24 million data points and equipped with a biologically significant similarity measure. 
In particular, we construct a quasi-metric tree index structure into our dataset, based on a known idea in 
molecular biology. Even if intended as a mere illustration and a building block for more sophisticated 
approaches, this scheme outputs 100 nearest neighbours from the actual dataset to any one of the 20^° 
virtual peptide fragments through scanning on average 0.53 %, and at most 3.5 %, of data. 

2. Workloads 

2.1. Defintion and basic examples 

A workload El is a triple W = (12, X, Q), where 12 is the domain, X is a finite subset of the domain 
{dataset, or instance), and Q C 2^ is the set of queries, that is, some specified subsets of 12. Answering 
a query Q ^ Q means listing all datapoints x € X n Q. 

Example 2.1. The trivial workload: 12 = X = {*}isa one-element set, with a sole possible query, 

Q = W- 


Example 2.2. Let X C 12 be a dataset. Exact match queries for X are singletons, that is, sets Q = {w}, 
uj G 12. 

Example 2.3. Let Wi = (12,, Xj, Qi),i = 1, 2,..., n be a finite collection of workloads. Their disjoint 
sum is a workload W = VL,, whose domain is the disjoint union 12 = 12i U122 U... U ^2,,, the dataset 
is the disjoint union X = Xi U X 2 U ... U X„, and the queries are of the form Qi U Q 2 LI ... U Qn, 
where Q, G Q,, i = 1, 2,..., n. 

2.2. Similarity queries 

A {dis)siniilarity measure on a set 12 is a function of two variables s: 12 x 12 —> M, possibly subject to 
additional properties. A range similarity query centred at x* ^ El consists of all x G 12 determined by 
the inequality s{x* ,x) < K or > K, depending on the type of similarity measure. 
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Figure 1. BLOSUM62 asymmetric distances. Distances within members of the alphabet partition used for in¬ 
dexing (cf. Subsect. l3.4l below') are greyed. 


A similarity workload is a workload whose queries are generated by a similarity measure. Different 
similarity measures, 5i and S 2 , on the same domain D can result in the same set of queries, Q, in which 
case we will call 5i and S 2 equivalent. 

Metrics are among the best known similarity measures. A similarity measure d{x, y) > 0 is called 
a quasi-metric if it satisfies d{x, y) = 0 4^ x = y and the triangle inequality, but is not necessarily 
symmetric. 


2.3. Illustration: short protein fragments 

The domain O consists of strings of length m = 10 from the alphabet S of 20 standard amino acids: 

n = 

The dataset X is formed by all peptide fragments of length 10 contained in the S wissProt database la 
of protein sequences of a variety of biological species (the release 40.30 of 19-0ct-2002). The fragments 
containing parts of low-complexity segments masked by the SEG program 1^ . as well as the fragments 
containing non-standard letters, were removed. The size of the filtered set is \X\ = 23,817, 598 unique 
fragments (31,380,596 total fragments). 

The most commonly used scoring matrix in sequence comparison, BLOSUM62 fl^ . serves as the 
similarity measure on the alphabet S, and is extended over the domain S™ via S{a, b) = 'S'(ai) h) 

(the ungapped score). 

The formula d{a, b) = s{a, a) — s{a, b), a,b £ T,, applied to the similarity measure given by BLO- 
SUM62, as well as of most other matrices from the BLOSUM family, is a quasi-metric on S (Figure^. 
One can now prove that the quasi-metric d on the domain given by d{a, b) = d{ai, bi) is equivalent 
to the similarity measure S. 
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Figure 2. Growth with regard to the product measure of e-neighbourhoods of our illustrative dataset in = 
The e-neighbourhoods are formed with regard to quasi-metric d (Subsect, \2.3i and the smallest metric 
majorizing d ('Ex. l4.6l below'). 

2.4. Inner and outer workloads 

We call a workload W inner if X = fl, otherwise W is outer. Typically, for outer workloads |X| <C |fl|. 

Example 2.4. Our illustrative workload is outer, with the ratio |X|/|0| = 23, 817, 598/20^*^ ^ 0.0000023. 

Moreover, Fig. |2l shows that an overwhelming number of points w e O have neighbours x ^ X 
within the distance of e = 25, which on average indicates high biological relevance. For this reason, 
most of the possible queries Q = are meaningful, and our illustrative workload is indeed outer in 

a fundamental way. 

The difference between inner and outer searches is particularly significant for similarity searches, 
and is often underestimated. 

In theory, every workload W = (fl,X, Q) can be replaced with an inner workload (X,X, Q\x), 
where the new set of queries Q\x consists of sets Q r\ X, Q ^ Q. However, in practical terms this 
reduction often makes little sense because of the prohibitively high complexity of storing and processing 
the query sets Q X. 

3. Indexing schemes 

3.1. Basic concepts and examples 

An access method for a workload W is an algorithm that on an input Q G Q outputs all elements of 
Q r\ X. Typical access methods come from indexing schemes. 

For a rooted finite tree T by L{T) we will denote the set of leaf nodes and by I{T) the set of inner 
nodes of T. The notation f G T will mean that f is a node of T, and Ct will denote the set of all children 
of a f G I{T), while the parent of t will be denoted p{t). 

Definition 3.1. Let W = (12, X, Q) be a workload. An indexing scheme on FF is a triple T = (T, B, T), 
where 
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• r is a rooted finite tree, with root node 

• is a collection of subsets Bt C Q, {blocks, or bins), where t G L{T). 

• T = {Ft: t G I{T)} is a collection of set-valued decision functions, Ft: Q 2*^*, where each 
value Ft{Q) C C* is a subset of children of the node t. 

Definition 3.2. An indexing scheme I = (T, B, T) for a workload W = (0, X, Q) will be called 
consistent if the following is an access method. 

Algorithm 3.1. 

on input Q do 

set Aq = {*} 
for each f = 0,1,... do 
if Aj 7^ 0 

then for each f e Aj do 
if t is not a leaf node 
then Aj+i ^ Aj+i U Ft{Q) 
else for each x G Bt do 

if X G Q 

then A ^ A U {x} 

return A 

The following is an obvious and easy to verify sufficient condition for consistency. 

Proposition 3.1. An indexing scheme 2 = (T, B, F) for a workload W = (fl, X, Q) is consistent if for 
every Q G Q and for every x G Q Cl X there exists t G L{T) such that x G Bt and the path sqSi ... Sm, 
where sq = *, Sm = t and Si = p{si+i), satisfies Sj+i G Fs^{Q) for alH = 0,1... m — 1. 

In fhe fufure we will be considering consisfenf indexing schemes only. 

Example 3.1. A simple linear scan of a dafasef X corresponds fo fhe indexing scheme where T = {*, *} 
has a roof and a single child, B consisfs of a single block S* = fl, and fhe decision function F* always 
oufpufs fhe same value {*}. 

Example 3.2. Hashing can be described in ferms of fhe following indexing scheme. The tree T has 
depth one, with its leaves corresponding to bins, and the decision function /* on an input Q outputs the 
entire family of bins in which elements of Q H X are stored. 

Example 3.3. If the domain is linearly ordered (for instance, assume = M) and the set of queries 
consists of intervals [a, h], a, h G 0, then a well-known and efficient indexing structure is constructed 
using a binary tree. The nodes t of T can be identified wifh elemenfs of chosen so fhaf fhe free 
is balanced. Each decision funclion Ft on an inpuf [a, 6] oufpufs fhe sef of all children nodes s of f 
safisfying 

{{t — a){s — o) > 0) A {{t — b){s — b) > 0). 
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Remark 3.1. The computational complexity of the decision functions Ft{Q), as well as the amount of 
‘branching’ resulting from an application of Algorithm im become major efficiency factors in case of 
similarity-based search, which is why we feel they should be brought into the picture. 

3.2. Metric trees 

Let (n, X, p) be a similarity workload, where p is a metric, that is, each query Q = is a ball of 

radius e > 0 around the query centre w e 

A metric tree is an indexing structure into (fl, X, p) where the decision functions are of the form 

Ft{B,{ui)) = {s^Ct-. fs{ui)<e} ( 1 ) 

for suitable 1-Lipschitz functions /^: —> M, one for each node s € T. (Recall that / : —> M is 

1-Lipschitz if \ fix) — fiy)\ < p(x, y) for each x, y G Q.) We call those ft certification functions. The 
set Ft{Bf:{u})) is output by scanning all children s of f and accepting / rejecting them in accordance with 
the above criterion. 

Theorem 3.2. Let W = (fl, X, p) be a metric similarity workload. Let T be a finite rooted tree, and 
\tt Bt,t G T be a collection of subsets of 11 (blocks), covering X and having the property that X C 
UtGL(T) F Q and for every inner node t, UsGCt(^s H X) C Bt. Let /t: fl —> M be 1-Lipschitz 
functions with the property (w G Bt) => < 0). Define decision functions Ft as in Eq. Q. Then 

the triple (T, {Bt}t£L{T)^ {Pt}tei{T)) is a consistent indexing scheme for W. 

We omit the proof because a more general result fTheorem 13. 3 1 is proved below. 

1-Lipschitz functions ft with a property required by the assumptions of Theorem 13.21 always exist. 
Once the collection Bt,t ^ T of blocks has been chosen, put 

ftiuj) = p{Bt,U}) := inf p{x,uj), 
xeBt 

the distance from a block Bt to an to. However, such distance functions from sets are typically com¬ 
putationally very expensive. The art of constructing a metric tree consists in choosing computationally 
inexpensive certification functions that at the same time don’t result in an excessive amount of branching. 

Example 3.4. The GNAT indexing scheme |@1 uses certification functions of the form 

ft±{io) = ± {p{i 0 ,xt) - Mt ), 

where xt is a datapoint chosen for the node t, Mt is the median value for the function u) i— p{L 0 ,xt), and 
t± are two children of t. 

Example 3.5. The vp-tree l2^ uses certification functions of the form 

ft{u;) = {l/2){p{xt+,uj) - p{xt_,u;)), 

where again t± are two children of t and xt^ are the vantage points for the node t. 
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Figure 3. A metric tree indexing scheme. To retrieve the shaded range query the nodes above the dashed line 
must be scanned; the branches below can be pruned. 


Example 3.6. The M-tree Gl employs, as certification functions, those of the form 

ft{uj) = p{xt,uj) - sup p{xt,T), 
rGBt 

where Bt is a block corresponding to the node t, xt is a datapoint chosen for each node t, and the suprema 
on the r.h.s. are precomputed and stored. 

There are many other examples of metric trees, e.g. fc-d tree, g/r-tree, mup-tree, etc. lEUGHil- They 
all seem to fit into the concept of a general metric tree equipped with 1-Lipschitz certification functions, 
first formulated in the present exact form in GHl- 

Example 3.7. Suppose = X = {0,1}"^, the set of all binary strings of length m. The Hamming 
distance between two strings x and y is the number of terms where x and y differ. A /c-neighbourhood of 
any point with respect to the Hamming distance can be output by a combinatorial generation algorithm 
such as traversing the binomial tree of order m to depth k. 

3.3. Quasi-metric trees 

Quasi-metrics often appear as similarity measures on datasets, and even if they are being routinely re¬ 
placed with metrics by way of what we call a projective reduction of a workload (Ex. KB . this may 
result in a loss of performance (cf. Ex. m. It is therefore desirable to develop a theory of indexability 
for quasi-metric spaces. 
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The concept of a 1-Lipschitz function is no longer adequate. Indeed, a 1-Lipschitz function /: —> 
M remains such with regard to the metric d{x, y) = max{/?(x, y), p{y.x)} on ff, and so using 1-Lipschitz 
functions for indexing in effect amounts to replacing p with a coarser metric d. A subtler concept be¬ 
comes necessary. 

Definition 3.3. Call a function / on a quasi-metric space (fl, p) left 1-Lipschitz if for all x,y ^ Cl 

fix) - fiy) < pix,y), 

and right 1-Lipschitz if fiy) — fix) < pix, y). 

Example 3.8. Let A be a subset of a quasi-metric space iCl, p). The distance function from A computed 
on the left, d(x,A) = inf{/ 9 (x,a): a € A}, is left 1-Lipschitz, while the function d(A, x) is right 
1 -Lipschitz. 

Now one can establish a quasi-metric (hence more general) analog of Theorem 13.21 

Theorem 3.3. Let W = (fl, X, p) be a quasi-metric similarity workload. Let T be a finite rooted tree, 
and let Bt,t ^ T be blocks covering X in such a way that X C ^ Cl and for every inner node 

t, ^) — ft- ^ IK be left 1-Lipschitz functions such that (cj G Bt) (/t(w) < 

0), t G 2(T). Define decision funcfions T) as in Eq. Q. Then fhe friple (T, {Bt}t^L{T), {Ft}t&i{T)) is 
a consisfenf indexing scheme for W. 

Proof: 

Lef X ^ Q r\ X = B^{(jj) n X. By fhe firsf covering assumption above, fhere exisfs a leaf node t such 
fhaf X G Bt. Consider fhe pafh sqSi ■ ■ ■ Sm where sq = *, Sm = t and Si = p(si+i), from roof fo t. By 
fhe second covering assumption above, for each i = 1, 2 ... m, we have iBt n X) C [Bs^ n X) C Bs^_.^ 
and hence x G Bs^. If follows fhaf fsiix) < 0 and, since is a lefl l-Lipschifz function, we have 

fsiiut) < fsii^x}) - fsiix) < piuj,x) < e. 

Therefore, Si G Fs^_.^ and consisfency follows by PropositionO 

Example 3.9. Many of fhe parficular fypes of mefric frees generalize fo a quasi-mefric setting. For in- 
sfance, M-free (Ex. 13. leads fo an indexing scheme info quasi-mefric spaces if fhe cerfificafion funcfions 
are chosen as 

ftiuj) = piuJ,Xt) - sup piT,Xt), 

T£Bt 

where Bt and xt are as in Ex. 13.61 

3.4. Illustration: a quasi-metric tree for protein fragments 

Here is a simple buf rafher efficienl implemenfafion of a quasi-mefric free on our workload of pepfide 
fragmenls (Subs. 2.3). 

Eel S, D = S™, and d be as in Subs. 2.3. Eel 7 be a partition of fhe alphabel S, fhaf is, a finife 
collection of disjoinl subsefs covering S. Denofe by T fhe prefix free of 7 ™, fhaf is, nodes of T are 
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Figure 4. Distribution of bin sizes (3,455,126 empty bins out of 9,765,625 total). 


strings of the form t = A1A2 ■ ■ ■ Ai, where Ai € 7, r = 1 , 2,Z < m, and the children of t are 
all strings of length I + 1 having t as its prefix. To every t as above assign a cylinder subset Bt C n, 
consisting of all strings lo € S”* such that Wj € Ai, i = 1, 2,. .. , Z. 

The certification function ft for the node t is the distance from the cylinder Bt, computed on the left: 
ft{uj) := d{io, Bt). The value of ft at any io can be computed efficiently using precomputed and stored 
values of distances from each a G S to every A £ y. The construction of a quasi-metric tree indexing 
into is accomplished as in Th. 1.3.31 

In our case, the standard amino acid alphabet is partitioned into five groups (Figure Q based on some 
known classification approaches fo aminoacids from biochemisfry. This partition induces a partition of 
n = info 5^° = 9, 765,625 bins. 

Since X confains 23,817,598 dafapoinfs, fhere are on average 2.4 poinfs per bin. The acfual disfri- 
bufion of bin sizes is sfrongly skewed in favour of small sizes (Fig. n and appears fo follow fhe DGX 
disfrubifion described in (3|. 

The performance of our indexing scheme is reflected in Fig. |5] Recall that an indexing scheme for 
similarity search that reduces the fraction of data scanned to below 10 % is already considered successful. 
Our figures are many times lower. 

Remark 3.2. While other partitions of S producing different indexing schemes are certainly possible, 
ours can be used for searches based on other BLOSUM matrices with little loss of efficiency, because 
most amino acid scoring matrices used in practice reflect chemical and functional properties of amino 
acids and hence produce very similar collections of queries. 


4. New indexing schemes from old 

4.1. Disjoint sums 

Any collection of access methods for workloads Wi , W 2 , • • •, Wn leads to an access method for the 
disjoint sum workload to answer a query Q = it suffices to answer each query Qi, 

i = 1,2,... ,n, and then merge the outputs. 



10 


V. Pestov, A. Stojmirovic /Indexing schemes for similarity search 



Figure 5. Percentage of dataset points scanned to obtain k nearest neighbours. Based on 20000 searches for each 
k. Query points were sampled with respect to the product measure based on amino acid frequencies. 


In particular, if each Wi is equipped with an indexing scheme, Xi = {Ti,Bi,IFi), then a new indexing 
scheme for denoted X = is constructed as follows: the tree T contains all Tj’s as 

branches beginning at the root node, while the families of bins and of certification functions for X are 
unions of the respective collections for all Xj, f = 1, 2 ,. .., n. 

4.2. Inductive reduction 

Let Wi = (Qj, Xi, Qi), i = 1, 2 be two workloads. An inductive reduction of Wi to W 2 is a pair of 
mappings i: Q 2 ^ ^ 1 , ■ Qi ^ Q 2 , such that 

• i(X2) D Xi, 

• for each Q € Qi, i~^(Q) C W(Q). 

i 

Notation: W 2 ^ Wi. 

An access method for W 2 leads to an access method for Wi, where a query Q G Qi is answered as 
follows: 

on input Q do 

answer the query W{Q) 
for each ?/ G X 2 n W{Q) do 
ifz(y) G Q 

then add x = i{y) on the list A 

return A 

If X 2 = {T 2 , 32 ,^ 2 ) is a consistent indexing scheme for W 2 , then a consistent indexing scheme 
Xi = r*(Xi) for Wi is constructed by taking Ti = T 2 , and fi^\Q) = ft‘^\w{Q)) (the 

upper index i = 1,2 refers to the two workloads). 
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Example 4.1. Let L be a finite graph of bounded degree, k. Associate to it a graph workload, Wr, 
which is an inner workload with X = Vr, the set of vertices, and a k-nearest neighbour query consists 
in finding N nearest neighbours of a vertex. 

A linear forest is a graph that is a disjoint union of paths. The linear arboricity, /a(r), of a graph T 
is the smallest number of linear forests whose union is T. This number is, in fact, fairly small: it does not 
exceed [3d/5], where d is the degree of T jT). This concept leads to an indexing scheme for the graph 
workload Wr, as follows. 

Let Fi, i = I,..., la{r) be linear forests. Denote F = let (/>: F ^ T be a surjective map 

preserving the adjacency relation. Every linear forest can be ordered, and indexed into as in Ex. 13.31 At 
the next step, index into the disjoint sum F as in Subs. 4.1. Einally, index into T using the inductive 
reduction f: F ^ T. This indexing scheme outputs nearest neighbourhs of any vertex of T in time 
O(dlogn), requiring storage space 0{n), where n is the number of vertices in T. 

Of course the similarity workload of the above type is essentially inner. 


4.3. Projective reduction 

Eet Wi = (Oj, Xj, Qi), i = 1, 2 be two workloads. A projective reduction of Wi to W 2 is a pair of 
mappings r: ^ O 2 , : Qi —> Q 2 , such that 

• r{Xi) C X2, 

• for each Q E Qi, r{Q) C r^{Q). 

Notation: Wi ^ IL 2 . 

An access method for W 2 leads to an access method for Wi, where a query Q E Qi is answered as 
follows: 

on input Q do 

answer the query r^{Q) 
for each y E X 2 n r^{Q) do 
for each x E r~^{y) do 
if X E Q 

then add x on the list A 

return A 

Eet T 2 = {T 2 , 32 ,^ 2 ) be a consistent indexing scheme for IT 2 . The projective reduction Wi W 2 
canonically determines an indexing scheme Ti = r*{l 2 ) as follows: Ti = T 2 , and 

Example 4.2. The linear scan of a dataset is a projective reduction to the trivial workload: W^{*}. 

If VL = (D, X, Q) is a workload and Q' is a domain, then every mapping r:Q—>Q' determines the 
direct image workload, r*(kL) = (D', r{X),r{Q)), where r{X) is the image of X under r and r(Q) is 
the family of all queries r{Q), Q £ Q. 
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Example 4.3. Let S be a finite collection of blocks covering fl. Define the discrete workload {B, B, 2^), 
and define fhe reducfion by mapping each rri G D to fhe corresponding block and defining each r{Q) as 
fhe union of all blocks fhaf meef Q. The corresponding reducfion forms a basic building block of many 
indexing schemes. 


Example 4.4. Lef Wi, i = 1,2 be fwo mefric workloads, fhaf is, fheir query sefs are generated by 
mefrics dt, i = 1,2. In order for a mapping f : Hi —>■ 0.2 wifh fhe properly f{Xi) C X 2 to determine 
a projecfive reducfion /: Wi ^ W 2 , it is necessary and sufficient that / be 1-Lipschitz: indeed, in this 
case every ball B^{x) will be mapped inside of the ball BJ (/(x)) in Y. 


Example 4.5. Pre-filtering is an often used instance of projective reduction. In the context of similarity 
workloads, this normally denotes a procedure whereby a metric p is replaced with a coarser distance d 
which is computationally cheaper. This amounts to the 1-Lipschitz map (11, X, p) {0, X, d). 

Example 4.6. The same applies to quasi-metrics. Moreover, it is routine to have a quasi-metric, p, 
replaced with a metric, d, having the property p{x, y) < d{x, y), so that one does not miss any hits. The 
usual choices are d{x,y) = max{/ 9 (x, y),/ 9 (y, x)}, or else d{x,y) = p{x,y) + p{y,x), followed by a 
rescaling. 


Example 4.7. A frequently used tool for dimensionality reduction of datasets is the famous Johnson- 
Lindenstrauss lemma, cf. e.g. fl^ or Sect. 15.2 in o. Let 0 = be an Euclidean space of high 
dimension, and let X C be a dataset with n points. If e > 0 and p is a randomly chosen orthog¬ 
onal projection of onto a linear subspace of dimension k = 0 (logn)/e^, then with overwhelming 


probability the mapping 


p does not distort distances within X by more than the factor of 1 ± e. 


The same is no longer true of the entire domain D = M^, meaning that the technique can be only 
applied to indexing for similarity search of the inner workload ( 2 f, Q), and not the outer workload 
(D,X, Q). 


Example 4.8. A projective reduction of a metric space 0 to one of a smaller cardinality. O', which in 
turn is equipped with a hierarchical tree index structure, is at the core of a general paradigm of indexing 
into metric spaces developed in ii- 


4.4. Illustration: our indexing scheme 

Our indexing scheme can be also interpreted in terms of projective reduction as in example R3I Denote 
by 7 the alphabet consisting of five groups info which fhe 20 aminoacids have been paitifioned. Lef 
q: S — > 7 be fhe map assigning to each amino acid fhe corresponding group. This map in ifs furn 
determines fhe map r = g”*: D —> where 0 = and 0-^, = 7 ™. The direcf image workload wifh 

domain 0.y, determined by fhe map r, can be indexed info using fhe binomial free as in example 1.3. 7 1 to 
generafe all bins fhaf can infersecf fhe neighbourhood of fhe query poinf. Denote fhis indexing scheme 
by T. Then fhe indexing scheme into D, described in Subs. 3.4, is jusf r*{X) as defined in Subs. 4.3. 
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NEIGHBOURS 

Figure 6. Ratio between the sizes of metric and quasi-metric balls containing k nearest neighbours with respect 
to quasi-metric. Each point is based on 5,000 samples. 


5. Performance and geometry 

5.1. Access overhead 

Let Wi = (Oj, Xi,Qi), i = 1, 2 be two workloads, and let Wi ^ W 2 be a projective reduction of 
Wi to W 2 . The relative access overhead of the reduction r is the function Pr-Q—^ [1, + 00 ), assuming 
for each query Q the value Pr{Q) ■= \r~^ (^~"(Q)) ^ X\ / \Q r\ X\. 

Example 5.1. The values for relative access overhead of our indexing scheme for protein fragments 
considered in terms of a projective reduction as in Subs. 4.4 can be easily obtained from Fig. |5l 

Example 5.2. The access overhead of the projective reduction consisting in replacing a quasi-metric 
with a metric (Example l4.6b can be very considerable. Fig. I^shows the overhead in the case of our dataset 
of fragments, where the quasi-metric p is replaced with the metric d{x, y) = max{/9(x, y), p{y, x)}. In 
our view, this in itself justifies the development of theory of quasi-metric trees. 

5.2. Concentration 

Let now W = (0, X, Q) be a similarity workload generated by a metric, d, on the domain. Denote by p 
the normalized counting measure supported on the instance X, that is, 

p{A) = \Ar\X\/\X\ (2) 

for an A C D. This /x is a probability measure on D. 

The triples of this kind, (D, p, p), where p is a metric and d is a probability measure on the metric 
space (D,p), are known as mm-spaces, or probability metric spaces, and they form objects of study 
of geometry of high dimensions {asymptotic geometric analysis), see 11411161 and many references 
therein. 
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The central technical concept is that of the concentration function an of an mm-space for e > 0, 
an(e) = 1 - inf |/r(A): A<fQ, fi{A) > , 

and an(0) = If the intrinsic dimension of a triple /r) is high, the concentration function 

an(e) drops off sharply near zero. Typically, the concentration function of a probability metric space 
of dimension of order d satisfies the Gaussian estimate an{e) < Ci exp{—C 2 e^d), where Ci, C 2 are 
suitable constants. This observation is known as the concentration phenomenon. 

The concentration function a is non-increasing, but need not be strictly monotone. For each x > 0, 
denote a^{x) = inf{e > 0: a(e) < x}. The following result is based on the same ideas as Lemma 4.2 
inCSl. 


Theorem 5.1. Let (0, p, p) be an mm-space, let e > 0 and let 13 be a collection of subsets B C Q such 
that p{[jB) = 1 and for all B e B, p{B) < ^ < |. Set <5 = a^(^). Then, for any e > 6, 

1. There exists m £ Q such that Bf:{uj) meets at least 


elements of B. 


min 



Q;(e — 5 ) 


- 1 


2. A left ball B^{uj) around u) £ Q meets on average (in io) at least 


elements of B. 


min 



1 


4a(e — 6 ) 


Proof: 

By assumption on each B £ B and by the choice of 6, p{B) < ^ < a((5). Decompose B into a collection 
of pairwise disjoint subfamilies z G / in a such way that a{6) < p{Ai) < 2a{5) for each A, = |J Bi. 
Clearly, 

1 11 

2a{5) ~ 2^ ~ a{5) 

Let d' = e — (5 > 0 so that (A^)^, C A^. By Lemma 4.1 of 


P {{Ai)e) > p (((Ai)^)^,) > 1 - a{5'), 

and hence the probability that a random left ball of radius e does not intersect Aj is less than a(e — <5). 
For any J C I, 

F ^ “ \ J\oL{e - d). 


The first claim follows by choosing J such that \J\ = min 11/| , 


1 

ci(e —5) 


- 1 


I > min I 


1 

Q(e —5) 


- 1 


so that p (PliGJ (^Oe) > 0. To prove the second statement, observe that the probability that a random 


ball of radius e meets at least 


2a.(e—5) 

B intersecting a ball of radius e is at least 


elements is at least i. Hence, the average number of subsets of 


4:a{e—5) 


□ 
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Figure 7. Percentage of bins scanned to obtain k nearest neighbours. Based on 20000 searches for each k. The 
query points were sampled with respect to the product measure based on amino acid frequencies. 


This result directly leads to the following corollary stated in terms of a range similarity workload 
(with fixed radius). 


Corollary 5.1. Let = (PL, X, p) be a metric similarity workload. Suppose the dataset X and the query 
centres are distributed according to the Borel probability measure p on Let be a finite set of blocks 
such that /r(|J B) = 1 and for any B ^ B, p{B) < ^ < j. Then the number of blocks accessed to 

retrieve the query B^{uj), where e > on average at least 


4 a(e—(^)) 


and in the worst case at 


least 




- 1 



1 

or 

2 ^ 


, whichever is smaller. 


Example 5.3. In order to apply such estimates to a particular workload, one needs to determine its 
concentration function. If one equips the dataset of peptide fragments with a metric as in Ex. 14.61 then 
it is not difficult to derive Gaussian upper estimates for the concentration function awi^) using standard 
techniques of asymptotic geometric analysis. First, one estimates the concentration function of 0 = 
equipped with the product measure using the martingale technique, and then one uses the way X sits 
inside of (the rate of growth of neighbourhoods of the dataset, cf. Fig. IJJi. However, the bounds 
obtained this way are too loose and do not lead to meaningful bounds on performance. One needs to 
learn how to estimate the concentration function of a workload more precisely. 

Fig. 0shows the actual number of bins accessed by our indexing scheme in order to retrieve k nearest 
neighbours for various k. Notice that both the number of bins and the number of points of the dataset 
visited (Fig. |5j appear to follow the power law with exponent approximately ^ with respect to the number 
of neighbours retrieved. 

For a concentual explanation of this phenomenon, consider first the following example. 


Example 5.4. The authors of 1201 have introduced the distance exponent which gives the intrinsic di¬ 
mension of a metric space with measure, by assuming that (at least for small e) the size of a ball B^{x) 
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Figure 8. Growth of balls in the illustrative dataset. 


grows proportionally to , where N is the dimension of the space. (This value is, essentially, an ap¬ 
proximation to the Minkowski dimension of the dataset.) They claimed that performance of metric trees 
could be well approximated in terms of the distance exponent. 

Fig. [8] shows (on the log-log scale) the rate of growth of measure of balls i?e(u;) in the illustrative 
dataset of peptide fragments for the quasi-metric. The rate of growth in the most meaningful range of e 
for similarity search — and therefore the distance exponent of our dataset — can be estimated as being 
between 10 and 11. 

Returning back to Figure 0 clearly the graphs in question show the average growth of a ball in the 
projective reduction q{X)) of our workload (cf. Subs. 14.41) against the growth of the ball of the 
same radius in the original space (n,X). Denote by k the number of true neighbours retrieved and 
by V(k) the corresponding number of fragments scanned. The power relationship can be written as 
V(k) = 0{k^). If we accept the reasoning behind the distance exponent, that is that k = 0{r^) where 
D is the “dimension” of the space of protein fragments, it follows that V{r) = 0{r^^). Using the same 
reasoning about the size of the ball in the reduced workload, we conclude that the “dimension” of it is 
FD, that is, the original dimension D is reduced by the factor of F i. Assuming that the values of the 
distance exponent do not depend on whether a quasi-metric or its associated metric is used and taking 
the values of distance exponent estimated in Example 15.41 the “dimension” of the reduced workload 
( 7 ”^, q{X)) is somewhere between 5 and 5.5. Thus, our indexing scheme has reduced the dimension by 
half. 


5.3. Concentration and certification functions 

Let f: U ^ M be a 1-Lipschitz function. Denote by M the median value of /. In asymptotic geometric 
analysis it is well known (and easily proved) that 


|/(u;) - M| > e} < 2an{e), 
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DISTANCE-MEDIAN 


Figure 9. Distributions of distances from 40,000 random points to a typical point (SEDRELLTEQ) in fl and of 
distances to a bin (the one containing the above fragment). 


that is, if O is high-dimensional, the values of / are concentrated near one value. If one sees such 
functions as random variables respecting the distance, the concentration phenomenon says that on a high¬ 
dimensional n, the distribution of / peaks out near one value. Using such / as certification functions in 
indexing scheme leads to a massive amount of branching and the dimensionality curse fl^ . 

Yet, there are reasons to believe that the main reason for the curse of dimensionality is not the inherent 
high-dimensinality of datasets, but a poor choice of certification functions. Efficient indexing schemes 
require usage of dissipating functions, that is, 1-Lipschitz functions whose spread of values is more 
broad, and which are still computationally cheap. This interplay between complexity and dissipation is, 
we believe, at the very heart of the nature of dimensionality curse. 

Example 5.5. One possible reason for a relative efficiency of our quasi-metric tree may be a good choice 
of certification functions, which are somewhat less concentrated than distances from points (Fig. B 


6. Conclusions 

Our proposed approach to indexing schemes used in similarity search allows for a unifying look at them 
and facilitates the task of transferring the existing expertise to more general similarity measures than 
metrics. In particular, we propose the concept of a quasi-metric tree based on a new notion of left 1- 
Lipschitz functions, and implement it on a very large dataset of peptide fragments to obtain a simple yet 
efficient indexing scheme. 

We hope that our concepts and constructions will meld with methods of geometry of high dimensions 
and lead to analysis of performance of indexing schemes for similarity search. While we have not yet 
reached the stage where asymptotic geometric analysis can give accurate predictions of performance, at 
least it leads to some conceptual understanding of their behaviour. 

We suggest using our dataset of protein fragments as a simple benchmark for testing indexing 
schemes for similarity search. 
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